Teaching Computers to See Patterns in Scatterplots with Scagnostics

Abstract:

As the number of dimensions in a dataset increases, the process of visualising its structure and variable dependencies becomes more tedious. Scagnostics (scatterplot diagnostics) are a set of visual features that can be used to identify interesting and abnormal scatterplots, and thus give a sense of priority to the variables we choose to visualise. Here, we will discuss the creation of the cassowaryr R package that will provide a user-friendly method to calculate these scagnostics, as well as the development of adjusted measures not previously defined in the literature. The package is be tested on datasets with known interesting visual features to ensure the scagnostics are working as expected,before being applied to time series, physics and AFLW data to show their value as a preliminary step in exploratory data analysis.

Cite PDF Tweet
Harriet Mason https://www.britannica.com/animal/quokka (Monash University) , Stuart Lee https://stuartlee.org (Genentech) , Ursula Laa https://uschilaa.github.io (University of Natural Resources and Life Sciences) , Dianne Cook https://dicook.org (Monash University)

Introduction

Visualising high dimensional data is often difficult and requires a trade-off between the usefulness of the plots and maintaining the structures of the original data. This is because the number of possible pairwise plots rises exponentially with the number of dimensions. Datasets like Anscombe’s quartet (Anscombe 1973) or the datasaurus dozen (Locke and D’Agostino McGowan 2018) have been constructed such that each pairwise plot has the same summary statistics but strikingly different visual features. This design is to illustrate the pitfalls of numerical summaries and the importance of visualisation. This means that despite the issues that come with increasing dimensionality, visualisation of the data cannot be ignored. Scagnostics offer one possible solution to this issue.

The term scagnostics was introduced by John Tukey in 1982 (Tukey 1988). Tukey discusses the value of a cognostic (a diagnostic that should be interpreted by a computer rather than a human) to filter out uninteresting visualisations. He denotes a cognostic that is specific to scatter plots a scagnostic. Up to a moderate number of variables, a scatter plot matrix (SPLOM) can be used to create pairwise visualisations, however, this solution quickly becomes infeasible. Thus, instead of trying to view every possible variable combination, the workload is reduced by calculating a series of visual features, and only presenting the outlier scatter plots on these feature combinations.

There is a large amount of research into visualising high dimensional data, most of which focuses on some form of dimension reduction. This can be done by creating a hierarchy of potential variables, performing a transformation of the variables, or some combination of the two. Unfortunately none of these methods are without pitfalls. Linear transformations are subject to crowding, where low level projections concentrate data in the centre of the distribution, making it difficult to differentiate data points (Diaconis and Freedman 1984). Non-linear transformations often have complex parameterisations, and can break the underlying global structure of the data, creating misleading visualisations. While there are solutions within these methods to fix these issues such as a burning sage tour which zooms in further on points closer to the middle of a tour to prevent crowding (U. Laa, Cook, and Lee 2020), or the liminal package which facilitates linked brushing between a non-linear and linear data transformations to maintaining global structure (Lee, Laa, and Cook 2020), all these methods still involve some transformation of the data. Scagnostics gives the benefit of allowing the user to view relationships between the variables in their raw form. This means they are not subject to the linear transformation issue of crowding, or the non-linear transformation issue of misleading global structures. That being said, only viewing pairwise plots can leave our variable interpretations without context. Methods such as those shown in ScagExplorer (Dang and Wilkinson 2014) try to address this by visualising the pairwise plots in relation to the scagnostic measures distribution, but ultimately the lack of context remains one of the limitations of using scagnostics alone as a dimension reduction technique.

Scagnostics are not only useful in isolation, they can be applied in conjunction with other techniques to find interesting feature combinations of the transformed variables. The tourr projection pursuit currently uses a selection of scagnostics to identify interesting low level projections and move the visualisation towards them (U. Laa and Cook 2020). Since scagnostics are not dependent on the type of data, they can also be used to compare and contrast scatter plots regardless of the discipline. In this way, they are a useful metric for something like the comparisons described in A self-organizing, living library of time-series data, which tries to organise time series by their features instead of on their metadata (Fulcher et al. 2020).

Several scagnostics have been previously defined in Graph-Theoretic Scagnostics (L. Wilkinson, Anand, and Grossman 2005), which are typically considered the basis of the visual features. They were all constructed to range [0,1], and later scagnostics have maintained this scale. The formula for these measures were revised in Scagnostic Distributions and are still calculated according to this paper (Leland Wilkinson and Wills 2008). In addition to the main nine, the benefit of using two additional association scagnostics were discussed in Katrin Grimm’s PhD thesis (Grimm 2016). These two association measures are also used in the tourr projection pursuit (U. Laa and Cook 2020).

There are two existing scagnostics packages, scagnostics (Leland Wilkinson and Wills 2008) and the archived package binostics (Ursula Laa et al. 2020). Both are based on the original C++ code from Scagnostic Distributions (Leland Wilkinson and Wills 2008), which is difficult to read and difficult to debug. Thus there is a need for a new implementation that enables better diagnosis of the scagnostics, and better graphical tools for examining the results.

This paper describes the R package, cassowaryr that computes the currently existing scagnostics, and adds several new measures. The paper is organised as follows. The next section explains the scagnostics. This is followed by a description of the implementation. Several examples using collections of time series and XXX illustrate the usage.

Scagnostics

Building blocks for the graph-based metrics

In order to capture the visual structure of the data, graph theory is used to calculate most of the scagnostics. The pairwise scatter plot is re-constructed as a graph with the data points as vertices and the edges are calculated using Delaunay triangulation. In the package, this calculation is done using the alphahull package (Pateiro-Lopez, Rodriguez-Casal, and. 2019) to construct an object called a scree. This is the basis for all the other objects that are used to calculate the scagnostics (except for monotonic, dcor and splines which use the raw data). The graph (screen object) is then used to construct the three key structures on which the scagnostics are based; the convex hull, alpha hull and minimum-spanning tree (MST) (Figure 1).

The building blocks for graph-based scagnostics

Figure 1: The building blocks for graph-based scagnostics

Graph-based scagnostics

The nine scagnostics defined in Scagnostic Distributions are detailed below with an explanation, formula, and visualisation. We will let A= alpha Hull C= convex hull, M = minimum spanning tree, and s= the scagnostic measure. Since some of the measures have some sample size dependence, we will let w be a constant that adjusts for that.

\[s_{convex}=w\frac{area(A)}{area(C)}\]
Convex Scagnostic Visual Explanation

Figure 2: Convex Scagnostic Visual Explanation

\[s_{skinny}= 1-\frac{\sqrt{4\pi area(A)}}{perimeter(A)}\]

Skinny Scagnostic Visual Explanation

Figure 3: Skinny Scagnostic Visual Explanation

\[s_{outlying}=\frac{length(M_{outliers})}{length(M)}\]

Outlying Scagnostic Visual Explanation

Figure 4: Outlying Scagnostic Visual Explanation

\[s_{stringy} = \frac{|V^{(2)}|}{|V|-|V^{(1)}|}\]

Stringy Scagnostic Visual Explanation

Figure 5: Stringy Scagnostic Visual Explanation

\[s_{skewed} = 1-w(1-\frac{q_{90}-{q_{50}}}{q_{90}-q_{10}})\]

Skewed Scagnostic Visual Explanation

Figure 6: Skewed Scagnostic Visual Explanation

\[s_{sparse}= wq_{90}\]

Sparse Scagnostic Visual Explanation

Figure 7: Sparse Scagnostic Visual Explanation

\[\max_{j}[1-\frac{\max_{k}[length(e_k)]}{length(e_j)}]\]

Clumpy Scagnostic Visual Explanation

Figure 8: Clumpy Scagnostic Visual Explanation

\[\frac1{|V|}\sum_{v \in V^{2}}I(cos\theta_{e(v,a)e(v,b)}<-0.75)\]

Striated Scagnostic Visual Explanation

Figure 9: Striated Scagnostic Visual Explanation

Association-based scagnostics

\[s_{monotonic} = r^2_{spearman}\]

Monotonic Scagnostic Visual Explanation

Figure 10: Monotonic Scagnostic Visual Explanation

The two additional scagnostics discussed by Katrin Grimm are described below.

\[s_{splines}=\max_{i\in x,y}[1-\frac{Var(Residuals_{model~i=.})}{Var(i)}]\]
Splines Scagnostic Visual Explanation

Figure 11: Splines Scagnostic Visual Explanation

\[s_{dcor}= \sqrt{\frac{\mathcal{V}(X,Y)}{\mathcal{V}(X,X)\mathcal{V}(Y,Y)}}\]
where \[\mathcal{V} (X,Y)=\frac{1}{n^2}\sum_{k=1}^n\sum_{l=1}^nA_{kl}B_{kl}\]
where \[A_{kl}=a_{kl}-\bar{a}_{k.}-\bar{a}_{.j}-\bar{a}_{..}\] \[B_{kl}=b_{kl}-\bar{b}_{k.}-\bar{b}_{.j}-\bar{b}_{..}\]

Dcor Scagnostic Visual Explanation

Figure 12: Dcor Scagnostic Visual Explanation

Checking the scagnostics calculations

Once we have working functions that correctly calculate the scagnostics according to their definition, we can assess how well they identify the visual features of scatter plots. To test the packages ability to differentiate plots, we have creates a dataset called “features” (that is also in the Cassowaryr package) that contains a series of interesting and unique scatter plots which we can run our scagnsotics on.

The Scatter Plots of the Features Dataset

(#fig:Features plot)The Scatter Plots of the Features Dataset

These scatter plots typify certain visual features we want to look for in scatter plots, be it deterministic relationships (such as that shown in the nonlinear feature), discreteness in variables (vlines), or clustering (clumpy), we should be able to use scagnostics to identify each of these scatter plots. Below is a visual table of an example of a high, a moderate, and a low value, on each scagnostic. The scagnostics are supposed to range from 0 to 1 however in some cases the values are so compressed that a moderate value would not fit, indicating that the scagnostics do not work quite as intended. We suspect the reason for these warped distributions is the removal of binning as a preliminary step in calculating the scagnostics. We wanted the package to have binning as an optional method, considering choices in binning can lead to bias as noted in “Scagnostic Distributions” (Leland Wilkinson and Wills 2008) or unreproducible results as noted in “Robustness of Scagnostics” . Therefore the current scagnostics will be assessed without binning (Wang et al. 2020).

The Features Scatterplots in a Visual Table

(#fig:Visual Table)The Features Scatterplots in a Visual Table

This plot gives a slight idea of the issues some of the scagnsotics face in their current state. The scagnostics based upon the convex hull (i.e. skinny and convex) work fine, as do the association measures such as montonic, dcor and splines. The main issue comes from the measures based on the MST, and their issues largely come from binning. The MST measures and their issues are:

With these issues in mind, we have defines and written several new scagnostics that work even without the pre-processing seto of binning.

The Adjusted Scagnostics Measures

The measures that need an adjusted version are striated, sparse, skewed, and clumpy. The outlying and stringy measure could possibly be left as they are, as they are not as badly damaged by the removal of binning.

Striated Adjusted

The issues surrounding the striated scagnostic are:

  1. By only counting vertices with 2 edges, the set of vertices counted in this measure are a subset of those counted in stringy, thus the two meaures are highly correlated.

  2. In order for the vertex to be counted, the angle between the edges needs to be approximately 135 to 220 degrees. The original idea seems to have been to identify the predominantly 180 degree angles that come with a discrete variable plotted against a continuous one, however the large margin of error just makes the measure almost identical to stringy.

To account for these two issues the striated adjusted measure considers all vertices (not just those with two adjacent edges), and makes the measure strict around the 180 and 90 degree angles. With this we can see the improvements on the measure.

A Visual Table Comparison of Striated and Striated 2

(#fig:Striated Vtable)A Visual Table Comparison of Striated and Striated 2

While these two measures may seem similar at a glace, there are a few minor things that make the striated2 scagnsotic an improvement on the stirated scagnsotic. First of all, the perfect 1 value on striated goes to the “line” scatter plot. While this does fulfil the definition, it is not what the measure is supposed to be looking for, rather supposed to be identifying the “vlines” scatter plot. Since striated does not count the right angles that go between the vertical lines, a truely striated plot will never get a full 1 on this measure, striated adjusted fixes this. After that there is a large gap in both measures because none of the other scatter plots have a strictly discrete measure on the x or y axis. The lower plots show that striated2 is also better at identifying discrete relationships with a rotation and noise added as shown in the “discrete” plot. In striated “discrete” is lower in the order than “outlying” which would indicate that striated has finished looking at discreteness. In striated2, after the plots with strict discreteness in “vlines” or strict rotated disceteness in “line,” is the noisey and rotated “discrete” plot. Therefore in terms of ordering the plots in how well they represent the feature of discreteness, striated2 outperforms striated.

The scagnsotics need to be used and interpreted with the type of dataset you are working with in mind. For if we are looking at a dataset that is discrete, a very low value on striated2 would indicate some strange relationship in the scatter plot. Since the old striated measure is specifically trying to find a continuous variable against a discrete variable, its highest values are also identified by the striated2. The lowest values on striated simply identify a plot where all the variables are at right angles, once again a measure of disceteness but one that is not identified by striated. Striated2 encapsulates both versions of discreteness in the values that get exactly a 1.

Clumpy Adjusted

The issues that need to be addressed with the new clumpy measure are:

  1. It needs to consider more than 1 edge in its final measure to make the measure more robust

  2. The impact of the ratio between the long and short edges need to be weighted by the size of their clusters so the measure does not simply identify outliers

  3. It should not consider vertices that’s adjacent angles form a straight line (to avoid identifying the angles striated identifies)

Before we calculated a new clumpy measure, we looked into applying a different adjustment defined in the Improving the robustness of scagnsotics that is a robust version of the original clumpy measure (Wang et al. 2020). This version of clumpy has been included in the package as “clumpy_r” however it is not included as an option in the higher level functions such as calc_scags() because its computation time is too long. This measure tries to build multiple clusters, each having their own clumpy value, and then returns the weighted sum, where each value is weighted by the number of observations in that cluster. This version of clumpy spreads the scatter plots more evenly between 0 and 1 and is more robust to outliers, however it does a poor job of ordering plots generally considered to be clumpy without the assistance of binning. Since this scagnostic cannot be used in large scale scagnostic calculations (such as those done on every pairwise combination of variables as is intended by the package) and it maintains the ordering issue from the original measure, it is not discussed here.

Therefore in order to fix the issues in the clumpy measure described above, we designed an adjusted clumpy measure, called clumpy2 in the package, and it is calculated as follows:

  1. Sort the edges in the MST
  2. Calculate the difference of adjacent vectors in this ordering, and find the index of the maximum. This maximum difference should indicate the jump from between cluster edges and inter-cluster edges.
  3. Remove the between cluster edges from the MST and build clusters using the remaining edges
  4. For each between cluster edge, take the smaller cluster (in number of observations) and take its median edge length. The clumpy value for that edge is the ratio between the large and small edge lengths \(\frac{edge_{small}}{edge_{large}}\), with a two multiplicative penaltys, one for uneven clusters \(\frac{2\times n_{small}}{n_{small}+n_{big}}\), and one for “stringy” scatter plots that is only applied if the stringy value is higher than 0.95, to reduce the arbitrarily large clumpy scores that come from striated plots \(1-s_{stringy}\).
  5. Take the mean clumpy value for each between cluster edge, if it is below 1 it is beneath the threshold that is consdiered clumpy, and the value is adjusted to 1.
  6. Clumpy 2 returns \(1-\frac{1}{mean(clumpy_i)}\)

With this calculation, we generate the clumpy2 measure which is compared to the original clumpy measure in the figure below.

A Visual Table Comparison of Clumpy and Clumpy 2

(#fig:Clumpy Vtable)A Visual Table Comparison of Clumpy and Clumpy 2

Here we can see the improvements made on the clumpy measure in both distribution from 0 to 1 and ordering. The measure is more spread out, and so values range more accurately from 0 to 1. More importantly the measures do a better job of ordering the scatter plots. On the original clumpy measure the “clusters” scatter plot was next to last, on the clumpy2 measure “clusters” is is identified as the most clumpy scatter plot. Clumpy 2 also has a penalty for uneven clusters (to avoid being large due to a small colelction of outliers) and clusters created arbitrarily due to discreteness (such as vlines) in order to better aling with the human interpretation of clumpy. With these changes, the stronger performance of clumpy2 is apparent in this visual table.

Software implementation

Installation

Data sets

Functions

Scagnostics functions

Drawing functions

Summary functions

Tests

Examples

Collections of time series

GOAL: Use scagnostics to find difference in shapes between groups. Here we want to first use features to describe a time series, and then secondly choosing pairs of features where there is the biggest difference between groups according to a scagnostic.

A paragraph describing the compenginets data

Analysis notes:

Compare two sets of time series

This analysis compares the features of macroeconomic and microeconomic series, using scagnostics. The goal of the comparison is to compare shapes, not necessarily centres of groups as might be done in LDA or other machine learning methods.

Here, just a small set of features is examined (because code fragile) but what emerges as interesting is the difference between curvature and trend strength (Figure 13). Microeconomic series tend to have high values on trend strength, and a range of values on curvature. In comparison macroeconomic series tend to have near constant average values on curvature, and highly varied on trend strength.

Plotting a few series actually suggests that the microeconomic series contain lots of micro structure, which might be what we should expect (Figure 14). Interestingly the trend strength seems to pick up the jaggies!

Interesting differences between two groups of time series detected by scagnostics. The time series are described by time series features, in order to handle different length series. Scagnostics are computed on these features separately for each set to explore for shape differences.

Figure 13: Interesting differences between two groups of time series detected by scagnostics. The time series are described by time series features, in order to handle different length series. Scagnostics are computed on these features separately for each set to explore for shape differences.

Selection of series from the two groups, macroeconomics and microeconomics. The difference is in the jagginess of the two series.

Figure 14: Selection of series from the two groups, macroeconomics and microeconomics. The difference is in the jagginess of the two series.

Black hole mergers

This is a simulated dataset that contains posterior samples for describing an observed gravitational waves signal from a black hole merger in terms of position in the sky (ra, dec, distance), time of the event (time) and the black hole properties (masses m1 and m2; spin related properties alpha, theta_jn, chi_tot, chi_eff, chi_p) and additional nuisance parameters psi (polarisation angle) and phi_jl (orbital phase). There are thus 13 variables and it is still feasible to look at a complete SPLOM, providing a good cross check of the scagnostics.

The data contains 9998 posterior samples, without binning it is too long to compute the scagnostics on such a large number of observations. For our purpose a much smaller sample is sufficient, and we randomly sample 200 observations before computing the scagnostics.

Combinations that stand out: time-ra, dec-ra, dec-time (low convex, high skinny) dec-ra and time-ra also have higher splines than dcor (both high, non-linear functional relation), while m1-m2, dec-time and chi_p-chi_tot have higher dcor than splines (still both high, m1-m2 and chi_p-chi_tot are linear relations with noise, dec-time is strong association but not function) The final plot is showing clumpy vs skewed, shows that clumpy isn’t really doing what we expect since we would expect much more structure in clumpy (in particular plots with time break up into two well separated groups, plots with ra in three separated groups, some other variables introduce less pronounced separation between groups). Included time-alpha as one example, this has clumpy of 0.9 and skewed of 0.7.

AFL player statistics

The Australian Football League Women’s (AFLW) is the national semi-profesisonal Australia Rules football league for female players. Here we will analyse data sourced from the official AFL website with information on the 2020 season, in which the league had 14 teams and 1932 players. There are 68 variables, 38 of which are numeric. The others are categorical, like the players names or match ids, which would not be used in scagnostic calculations. These numeric variables are recorded per player per game, and a description of each variable in this data set can be found in the appendix. With 33 numeric variables, there are 528 possible scatterplots to make. This is much more than we could possibly plot ourselves, and so we can use the scagnostics to identify which might be interesting to examine ourselves. The figure below displays 5 scatter plots that were identified as having a particularly high or low value on a scagnostic, or an unusual combination of two or more scagnostics. In addition to these 5, there is a 6th plot that is included to display what a middling value on almost all of the scagnostics looks like. You may like to test your scagnostics knowledge by guessing which plot is the middling value on all the scagnsotics.

Plots 1 to 5 are examples of unusual combinations of scagnostics, Plot 6 is an example of a scatter plot that was had moderate values across all the scagnostics and was mostly picked at random. We can present Plot 6 alongside two other scatter plots that were selected arbitrarily (the same way we would if we were going to try and do EDA ourselves) to give an idea of what we would get if we arbitrarily selected variables to plot.

We could plot scatter plots like this all day, but most of the scatter plots in this data set look something like this, and when compared to Plots 1 to 5 we can see that the extreme values on the scagnostic measurements identify atypical scatter plots. While it is interesting to know that scagnostics can pick out interesting scatter plots, we still need to know how to use them. Typically the plots with strange scagnostic combinations are identified using an interactive SPLOM, but for the sake of space, we are only going to show the specific scatter plots of the SPLOM that led to the selection of Plots 1, 2, and 5. Lets start with Plot 1.

Plot 1

We identified Plot 1 as interesting as it returned high values on both Outlying and Skewed. This indicates that even after removing outliers, the data was still disproportionately spread out, a trend we can see very clearly in the identified scatter plot.

Plot 2

Plot 2 scored very highly on all the association measures, which indicates a strong relationship between the two variables. The three association measures typically have strong correlation, and scatter plots that stay within the large mass in the center have a linear relation, scatter plots that deviate from this large correlation typically have some strong non-linear relationship. Unfortunately that does not appear here, and so none of these variable pairs have strong non-linear relationships, rather our highest scagnostic on the association measures indicates the linear relationship between total posessions and disposals. Total possessions is the number of times the player has the ball and disposals is the number of times the player gets rid of the ball legally, so the high correlation makes sense, this is a professional league so most of the players succeed in getting rid of the ball legally.

Plot 5

This plot is an excellent example in what new information we can learn from a unique pairwise relationship. This scatter plot is separate from the mass of pairwise relationships because it was high on striated_2 and low on outlying, which tells us most of the points are at right angles and a little spread out (but not enough for a high outlying value). This plot tells us something interesting about the physicality of the players. If a specific sports statistic is related to position, we would see a relationship have a lower triangular structure similar to that of Plot 4, however this plot does not have a lower triangular structure, is has an L-shape. This means these statistics are not about position, but rather the physical abilities of the players. Hitouts measure the number of times the player punches the ball after the referee throws it back into play, bounces have to be done while running, and are typically done by fast players. The l-shape tells us that players who do one very rarely perform the other. The moderate spread along both of the statistics tells us these are both somewhat specialised skills, and the players who specialise in one do not specialise in the other, i.e. in AFL the tallest player in the team is rarely the fastest.

Splines Work

Var1 Var2 splines
totalPossessions disposals 0.94
clearances.totalClearances clearances.stoppageClearances 0.88
goalAccuracy goals 0.83
metresGained kicks 0.77
dreamTeamPoints disposals 0.74
disposals kicks 0.72
dreamTeamPoints totalPossessions 0.72
totalPossessions uncontestedPossessions 0.68
dreamTeamPoints kicks 0.67
uncontestedPossessions disposals 0.66

Figure 15: Scatterplots with high values on the splines scagnostic. Mouseover to examine the players relative the the statistics.

Figure 15 shows three scatterplots that score highly on the splines scagnostic. Each of these shows a relatively strong monotonic relationship between the two variables. In the interactive version of the plot, mouse over reveals some high-performing players, e.g. Anne Hatchard has a lot of possessions, disposals and kicks, and Kaitlyn Ashmore kicked 4 goals in a match with 100% accuracy.

NOTE: Each player is represented multiple times here, I think. The stats are per game. Maybe it is better to aggregate for each player and re-do the statistics?

(#fig:some_are_kickers)Some players tend to kick the ball, even when challenged, whereas others more often use handball for disposals.

World Bank Development Indicators

The World Bank delivers a lot of development indicators (World Bank 2021), for many countries and multiple years. The sheer volume of indicators, in addition to substantial missing values, makes a barrier to analysis. This is a good example to where scagnostics can be used to identify pairs of indicators with interesting relationships.

Here we have downloaded indicators from 2018 for a number of countries. First, the data needs some pre-processing, to remove variables which have mostly missing values, and countries which have mostly missing values. The scagnostics will be calculated on the pairwise complete data, so it is ok to leave a few sporadic missings. At the end of the pre-processing, there are 20 indicators for 79 countries.

Figure 16: Most of the pairs of indicators exhibit outliersor are stringy. There is one pair that has clumpy as the highest value. There are numerous pairs that have a highest value on convex.

Summary

Appendix

AFLW Data Variable Descriptions

Anscombe, F. J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27 (1): 17–21. https://doi.org/10.1080/00031305.1973.10478966.
Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.
Dang, Tuan Nhon, and Leland Wilkinson. 2014. “ScagExplorer: Exploring Scatterplots by Their Scagnostics.” In 2014 IEEE Pacific Visualization Symposium, 73–80. https://doi.org/10.1109/PacificVis.2014.42.
Diaconis, Persi, and David Freedman. 1984. “Asymptotics of Graphical Projection Pursuit.” The Annals of Statistics 12 (3): 793–815. http://www.jstor.org/stable/2240961.
Fulcher, Ben D, Carl H Lubba, Sarab S Sethi, and Nick S Jones. 2020. “A Self-Organizing, Living Library of Time-Series Data.” Scientific Data 7 (1): 213–13.
Grimm, Katrin. 2016. “Kennzahlenbasierte Grafikauswahl.” Doctoral thesis, Universität Augsburg.
Laa, U., and D. Cook. 2020. Using Tours to Visually Investigate Properties of New Projection Pursuit Indexes with Application to Problems in Physics.” Computational Statistics 35: 1171–1205. https://doi.org/10.1007/s00180-020-00954-8.
Laa, U., D. Cook, and S. Lee. 2020. “Burning Sage: Reversing the Curse of Dimensionality in the Visualization of High-Dimensional Data.” arXiv: Computation.
Laa, Ursula, Hadley Wickham, Dianne Cook, and Heike Hofmann. 2020. “Binostics: Computing Scagnostics Measures in r and c++.” https://github.com/uschiLaa/paper-binostics.
Lee, Stuart, Ursula Laa, and Dianne Cook. 2020. “Casting Multiple Shadows: High-Dimensional Interactive Data Visualisation with Tours and Embeddings.” https://arxiv.org/abs/2012.06077.
Locke, Steph, and Lucy D’Agostino McGowan. 2018. datasauRus: Datasets from the Datasaurus Dozen. https://CRAN.R-project.org/package=datasauRus.
Pateiro-Lopez, Beatriz, Alberto Rodriguez-Casal, and. 2019. Alphahull: Generalization of the Convex Hull of a Sample of Points in the Plane. https://CRAN.R-project.org/package=alphahull.
Tukey, John. 1988. “The Collected Works of John w. Tukey.” In, edited by William S. Cleveland, 411, 427, 433. Chapman; Hall/CRC.
Wang, Yunhai, Zeyu Wang, Tingting Liu, Michael Correll, Zhanglin Cheng, Oliver Deussen, and Michael Sedlmair. 2020. “Improving the Robustness of Scagnostics.” IEEE Transactions on Visualisations and Computer Graphics 26 (1): 759–69.
Wilkinson, L., A. Anand, and R. Grossman. 2005. “Graph-Theoretic Scagnostics.” In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005., 157–64.
Wilkinson, Leland, and Graham Wills. 2008. “Scagnostics Distributions.” Journal of Computational and Graphical Statistics 17 (2): 473–91.
World Bank. 2021. “World Development Indicators. The World Bank Group.” https://databank.worldbank.org/source/world-development-indicators.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".